Fix exception ref-count leak in non-ref catch handlers. by cfallin · Pull Request #12860 · bytecodealliance/wasmtime

cfallin · 2026-03-27T23:17:38Z

When a Wasm throw instruction executes, the throw_ref libcall was cloning the GC ref (incrementing the refcount), but the catch handler never decremented it. This caused every caught exception to leak, leading to unbounded GC heap growth in throw/catch loops.

Two fixes:

Remove the unnecessary clone_gc_ref() in throw_ref. The throw/throw_ref instructions consume the exnref operand, so ownership transfers naturally to pending_exception without cloning.
In create_catch_block, emit a drop_gc_ref call for non-ref catches (Catch::One, Catch::All) after field extraction. These catches consume the exnref without passing it to the branch target, so the refcount must be decremented.

Also adds Store::gc_heap_size() / StoreContext::gc_heap_size() accessors and a throw_catch_many_times integration test that throws and catches 100K exceptions in a loop, asserting the GC heap stays within a single 64 KiB page.

When a Wasm `throw` instruction executes, the `throw_ref` libcall was cloning the GC ref (incrementing the refcount), but the catch handler never decremented it. This caused every caught exception to leak, leading to unbounded GC heap growth in throw/catch loops. Two fixes: 1. Remove the unnecessary `clone_gc_ref()` in `throw_ref`. The `throw`/`throw_ref` instructions consume the exnref operand, so ownership transfers naturally to `pending_exception` without cloning. 2. In `create_catch_block`, emit a `drop_gc_ref` call for non-ref catches (`Catch::One`, `Catch::All`) after field extraction. These catches consume the exnref without passing it to the branch target, so the refcount must be decremented. Also adds `Store::gc_heap_size()` / `StoreContext::gc_heap_size()` accessors and a `throw_catch_many_times` integration test that throws and catches 100K exceptions in a loop, asserting the GC heap stays within a single 64 KiB page.

fitzgen · 2026-03-30T11:56:20Z

crates/cranelift/src/func_environ/gc/enabled.rs

+/// Drop a GC reference to an exception object, decrementing its ref count.
+///
+/// This should be called when a `catch` (non-ref) handler has finished
+/// extracting fields from the caught exnref and no longer needs it.
+pub fn translate_drop_exnref(
+    func_env: &mut FuncEnvironment<'_>,
+    builder: &mut FunctionBuilder<'_>,
+    exnref: ir::Value,
+) -> WasmResult<()> {
+    log::trace!("translate_drop_exnref({exnref:?})");
+    match func_env.tunables.collector {
+        #[cfg(feature = "gc-drc")]
+        Some(Collector::DeferredReferenceCounting) => {
+            let drop_gc_ref_libcall = func_env.builtin_functions.drop_gc_ref(builder.func);
+            let vmctx = func_env.vmctx_val(&mut builder.cursor());
+            builder.ins().call(drop_gc_ref_libcall, &[vmctx, exnref]);
+        }
+        // The null collector doesn't track ref counts, so nothing to do.
+        _ => {
+            // Avoid unused-parameter warning.
+            let _ = builder;
+        }
+    }
+    Ok(())
+}


A couple things:

If we add a new collector that requires some kind of explicit drop here, we will silently repeat this bug because of the wildcard. We shouldn't have wildcards for this kind of thing.

Even better would be to move this into a trait method on the GC compiler, as the dual of GcCompiler::alloc_exn.

fitzgen · 2026-03-30T12:06:44Z

crates/cranelift/src/translate/code_translator.rs

+    } else {
+        // For non-ref catches, the exnref is consumed here and not
+        // passed to the branch target.
+        environ.translate_drop_exnref(builder, exn_ref)?;


I don't quite understand. The GC ref should logically be held alive by the Wasm stack (in the over-approx table) and then get dropped upon the next GC. Why wasn't that happening? Are we missing a call to expose_gc_ref_to_wasm in the throw code? (Probably)

If so, then we should do that instead.

We do call expose_gc_ref_to_wasm here when we manually convert to a wasmtime::ValRaw but you're right, I think it's missing in the throw path. I'll dig deeper -- thanks!

cfallin · 2026-03-30T19:41:44Z

So digging into this a bit more I see that we always expose_gc_ref_to_wasm in the gc_alloc_raw libcall, which is used both by the exn throw path and other allocation paths (e.g. for GC structs): here

The root issue is actually that the growth heuristic for the GC heap creates the appearance of a leak (and, arguably as we play with semantics, maybe a leak in practice?) The problem I'm trying to solve is that I want setjmp/longjmp with exceptions to be robust and lightweight; that's what the test in this PR is made to emulate. If we continually throw and catch, without holding the exnref, we have only at most one object live at a time. A theoretically ideal GC should somehow see that and keep the working-set size small.

I had short-circuited the "deferred" part of DRC here with explicit drops which I agree is the wrong approach now that I dig in more. But it should also be the case that a C program using SjLj with exceptions should not grow the GC heap to its max size (4GiB say?) before ever collecting -- that seems like an objectively bad heuristic choice.

Perhaps what we need is a collection heuristic like what we do for OwnedRooted, where we have a watermark based on the actual size after last collection, and collect again at twice that size? That allows growth up to a large-working-set regime without more than amortized-constant overhead, but should still bound the working set size pretty tightly. What do you think @fitzgen?

fitzgen · 2026-03-30T20:34:51Z

So digging into this a bit more I see that we always expose_gc_ref_to_wasm in the gc_alloc_raw libcall, which is used both by the exn throw path and other allocation paths (e.g. for GC structs): here

It needs to be called every time you pass a reference from the host, to Wasm. For example, it looks like it is missing here:

wasmtime/crates/wasmtime/src/runtime/vm/libcalls.rs

Lines 1740 to 1753 in 5d52f56

    
           #[cfg(feature = "gc")] 
        
           fn throw_ref( 
        
               store: &mut dyn VMStore, 
        
               _instance: InstanceId, 
        
               exnref: u32, 
        
           ) -> Result<(), TrapReason> { 
        
               let exnref = VMGcRef::from_raw_u32(exnref).ok_or_else(|| Trap::NullReference)?; 
        
               let exnref = store.unwrap_gc_store_mut().clone_gc_ref(&exnref); 
        
               let exnref = exnref 
        
                   .into_exnref(&*store.unwrap_gc_store().gc_heap) 
        
                   .expect("gc ref should be an exception object"); 
        
               store.set_pending_exception(exnref); 
        
               Err(TrapReason::Exception) 
        
           }

I'm pretty sure that adding a call to expose_gc_ref_to_wasm there will fix the leak.

The root issue is actually that the growth heuristic for the GC heap creates the appearance of a leak (and, arguably as we play with semantics, maybe a leak in practice?) The problem I'm trying to solve is that I want setjmp/longjmp with exceptions to be robust and lightweight; that's what the test in this PR is made to emulate. If we continually throw and catch, without holding the exnref, we have only at most one object live at a time. A theoretically ideal GC should somehow see that and keep the working-set size small.

Right, I agree that we need to fix the heuristics for heap growth vs collection, but I don't think we should be special-casing exnrefs.

Instead, we should fix the heuristics and make it so that exnrefs Just Work the way we want them to, and that falls out automatically from updating those heuristics. That, and then also the other stuff in #11256, in the fullness of time.

In the meantime, you can update the test to use a resource limiter, custom memory creator, or the pooling allocator to constrain the max size of memories (and therefore also the GC heap) and it should still exercise the leak-checking even without the grow-or-collect heuristics fixed. The pooling allocator is probably the easiest.

cfallin · 2026-03-30T20:48:12Z

It needs to be called every time you pass a reference from the host, to Wasm. For example, it looks like it is missing here:

I'm a bit confused: the code that you link is a libcall that takes a ref from Wasm and pulls it into the host (to save on the store). Did you mean to link somewhere else?

Separately: writing a variant of this test, that allocates a GC struct and immediately drops it in a loop, causes the GC heap to grow without collection up to its size limit in the same way. That seems to confirm to me that this is not exn-specific but rather a general GC growth heuristic problem?

Right, I agree that we need to fix the heuristics for heap growth vs collection, but I don't think we should be special-casing exnrefs.

Yes, agreed; the next paragraph of my earlier comment suggests a growth heuristic (for the whole GC heap, not just for exnref-specific cases), exactly as you're saying. (In other words: I am definitely not proposing an exceptions-specific change.) I would imagine it could/should become the default growth heuristic. What do you think of the proposed heuristic?

fitzgen · 2026-03-30T20:59:15Z

Ah sorry I got confused because I had assumed that this was not about the heuristics and was something specific to the interaction between exnref and libcalls, and I was trying not to complicate the discussion by getting into multiple things at once. That backfired :-p

Perhaps what we need is a collection heuristic like what we do for OwnedRooted, where we have a watermark based on the actual size after last collection, and collect again at twice that size? That allows growth up to a large-working-set regime without more than amortized-constant overhead, but should still bound the working set size pretty tightly. What do you think @fitzgen?

Yes, this is roughly what I've been imagining, although there is the minor wrinkle of growing in units of pages, but wanting to do the accounting at a finer-grained level so that the single-page GC heap use case works correctly.

To be precise, what I've had in my head is this:

After each collection we record the live set size somewhere -- maybe GcStore?
When deciding whether to GC or grow the heap, we look at the last live set size, and if it is less than or equal to half our GC heap size, then we do a collection. Otherwise, we grow.
The initial live set size is zero, which causes us to always GC before growing, which should preserve the single-page GC heap use case, because if there is only ever one exnref live then our live set should always end up being below half the GC heap size.

This should also avoid doing a bunch of GCs for each ~power of two on the way up to 64KiB, which is just wasting cycles.

How does that sound?

fitzgen · 2026-03-30T21:00:56Z

After each collection we record the live set size somewhere -- maybe GcStore?

And because refcounting operates on anti-matter rather than matter, this does unfortunately mean that the DRC collector will need to always have a running count of allocated bytes, rather than being able to compute it just at GC time like the copying collector will be able to. Ah well, not a big deal.

cfallin · 2026-03-30T21:04:19Z

That does sound reasonable, and I'm happy to work on that -- thanks! Closing this in the meantime (I'll bring back both the sjlj exception test and struct-alloc-then-drop test once we have said heuristic and assert they both stay within one page).

cfallin requested review from a team as code owners March 27, 2026 23:17

cfallin requested review from fitzgen and removed request for a team March 27, 2026 23:17

github-actions bot added the wasmtime:api Related to the API of the `wasmtime` crate itself label Mar 28, 2026

cfallin force-pushed the exception-leak branch from eebdba7 to d851296 Compare March 28, 2026 15:00

cfallin force-pushed the exception-leak branch from d851296 to e886233 Compare March 28, 2026 15:17

fitzgen reviewed Mar 30, 2026

View reviewed changes

Add test demonstrating GC growth with small working set as well

7214db6

cfallin closed this Mar 30, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix exception ref-count leak in non-ref catch handlers.#12860

Fix exception ref-count leak in non-ref catch handlers.#12860
cfallin wants to merge 2 commits intobytecodealliance:mainfrom
cfallin:exception-leak

cfallin commented Mar 27, 2026

Uh oh!

fitzgen Mar 30, 2026

Uh oh!

fitzgen Mar 30, 2026

Uh oh!

cfallin Mar 30, 2026

Uh oh!

cfallin commented Mar 30, 2026

Uh oh!

fitzgen commented Mar 30, 2026

Uh oh!

cfallin commented Mar 30, 2026 •

edited

Loading

Uh oh!

fitzgen commented Mar 30, 2026

Uh oh!

fitzgen commented Mar 30, 2026

Uh oh!

cfallin commented Mar 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

cfallin commented Mar 27, 2026

Uh oh!

fitzgen Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

fitzgen Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

cfallin Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

cfallin commented Mar 30, 2026

Uh oh!

fitzgen commented Mar 30, 2026

Uh oh!

cfallin commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fitzgen commented Mar 30, 2026

Uh oh!

fitzgen commented Mar 30, 2026

Uh oh!

cfallin commented Mar 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

cfallin commented Mar 30, 2026 •

edited

Loading